Descriptive Statistics

Table 1. Median Vaccination Rate by County, USA, 11/10/2021
Locality Median Vaccination Rate (%) IQR
Metro 57.9 ( 47.7-66 )
Non-metro 49.5 ( 42.3-57.5 )

Below we display the scatter plots and linear regression overlays for all other covariates that did not have plots included in the final manuscript.

Figure 1. Scatterplot and linear regression overlay with median household income as the predictor and percent of county vaccinated as the outcome

Figure 2. Scatterplot and linear regression overlay with unemployment rate as the predictor and percent of county vaccinated as the outcome

Figure 3. Scatterplot and linear regression overlay with unemployment rate as the predictor and percent of county vaccinated as the outcome

Modeling

Correlation Matrix

Median income and percent poverty are highly correlated (-0.77). Further, percent bachelors is strongly correlated with median income (0.62). We decided to remove median income due the strong correlation with both our main predictor and percent poverty.

Table 2. Correlation Matrix for all Variables Considered in Model
pct_vax pct_bachelors unemployment median_income pct_poverty
pct_vax 1.00 0.46 0.19 0.40 -0.25
pct_bachelors 0.46 1.00 -0.04 0.62 -0.35
unemployment 0.19 -0.04 1.00 -0.14 0.31
median_income 0.40 0.62 -0.14 1.00 -0.77
pct_poverty -0.25 -0.35 0.31 -0.77 1.00

LASSO (Resampling)

Before running the LASSO and Decision Tree Models, we used a 5-fold cross-validation, 5 times repeated, creating a resample object for the training data with these specifications. We ran a LASSO model, training 30 penalized linear regressions in order to tune the model to select our best fit. We visualize the validation set metrics by plotting the RMSE against the range of penalty values in figure 4. This plots shows that model performance is generally better at the smaller penalty values. This suggests that the majority of the predictors are important to the model. We also see a steep increase in the RMSE towards the highest penalty values. This happens because a large enough penalty will remove all predictors from the model, causing the predictive accuracy to drop. After running the LASSO model, all predictors remained in the model. The best performing LASSO model had a penalty of 0.0452 and RMSE of 12.2, indicating that it performed better than the null model. We can see the optimal penalty where all the lines meet on the figure on the right.

Figure 4. Lasso Tuning Plots and Model Diagnostics

Decision Tree

We ran a decision tree model using the same 5-fold cross-validation above and tuned the hyperparameters to improve the model performance. The best performing Decision Tree model had a cost complexity of 0.000562, tree depth of 8 and RMSE of 12.7, indicating that it performed better than the null model, but not quite as well as the LASSO. We also used the decision tree model to estimate variable importance. This can be seen in figure 5. Here we see that our main predictor, percent with bachelor’s degree appears to be most important, while locality is the least important in the model.

Figure 5. Decision Tree Plot of Important Variables and Model Diagnostics

Multivariate Linear Regression and Simple Linear Regression Model Diagnostics

We first calculated the RMSE for the null model, which equaled 14.5. Then we ran a full model with all predictors and plotted diagnostics to compare the fits with the null model. The RMSE for the model with all predictors was 12.1, indicating that the full model performed better at reducing the RMSE than the null model. We repeated these steps again, but for a model with only the main predictor and calculated RMSE of 12.8, indicating that the simple model performed better than the null, but not quite as well as the full model. Below are the the diagnostic plots for the complex model and the simple model.

Figure 6. Model Diagnostics for Full Model and Simple Model on Train Data

Univariate Models with Other Predictors

We also ran univariate models with our other predictors and plotted diagnostics and calculated the RMSE for each. For the univariate model with only unemployment and our outcome, the RMSE was 14.2, which is close to the RMSE of the null model, which was 14.5. This suggests that it might not be adding much to the model. For the univariate model with only poverty and our outcome, the RMSE was 14.0, which is also close to the RMSE of the null model, suggesting that it might not be adding much to the model. For the univariate model with only locality and our outcome, the RMSE was 14.2, suggesting that it might not be adding much to the model.

Figure 7. Model Diagnostics for Simple Models (Unemployment and Poverty) on Train Data

Final Model

After testing and plotting diagnostics on the models described above, we see that most of the models do not strongly predict our outcome or reduce the RMSE. Taking all metrics and plots into consideration, we decide that the simple univariate model with percent of population with a bachelor’s degree as the chosen model to answer our main question and best fit the data. We run a final fit on our test data using this simple model and present the performance statistics and diagnostic plots below. The RMSE error for the simple model on the testing data was 12.4, while it was 12.1 on the simple model which was fit to the training data. The residual plot on this simple model below displays the residuals to be appropriately scattered with only a few outliers, suggesting that the model is a decent fit. The predicted vs observed plot on the left shows that the data does not completely fall on a 45 degree angle which is expected given the small increment in which the RMSE changes in the simple model relative to the null.

Figure 8. Left: Predicted vs Observed from Final Model with only percent bachelor’s degree on test data. Right: Predicted vs Observed from Final Model with only percent bachelor’s degree on Test Data

Additional Descriptive Plots that Explore the Distribution of Each Variable

Table of Missingness for Each Variable in the Dataset

Table 3. Tables of Missingness Among Variables in Study
Missing Percent
FIPS 0 0.00
pct_vax 56 1.78
locality 1 0.03
county 0 0.00
state 0 0.00
pct_bachelors 8 0.25
unemployment 0 0.00
median_income 0 0.00
pct_poverty 0 0.00

Simple Linear Regression Model with Only Main Predictor Run During Exploratory Analysis Phase

Note: This was not run on the train data, but all of the data before we implemented ML methods

Table 4. Results of the simple linear regression model with percent of county with bachelors degree as the predictor and percent of county vaccinated as the outcome
term estimate std.error statistic p.value
(Intercept) 38.495765 0.520051 74.02306 0
pct_bachelors 1.007582 0.034619 29.10487 0
Table 5. Performance statistics on simple model percent of county with bachelors degree as the predictor and percent of county vaccinated as the outcome
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.21598 0.215725 12.6557 847.0936 0 1 -12174.83 24355.66 24373.76 492512.9 3075 3077